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ABSTRACT 

The coarsest approximation of the structure of a complex 
network, such as the Internet, is a simple undirected un- 
weighted graph. This approximation, however, loses too 
much detail. In reality, objects represented by vertices and 
edges in such a graph possess some non-trivial internal struc- 
ture that varies across and differentiates among distinct types 
of links or nodes. In this work, we abstract such additional 
information as network annotations. We introduce a net- 
work topology modeling framework that treats annotations 
as an extended correlation profile of a network. Assuming 
we have this profile measured for a given network, we present 
an algorithm to rescale it in order to construct networks of 
varying size that still reproduce the original measured an- 
notation profile. 

Using this methodology, we accurately capture the net- 
work properties essential for realistic simulations of net- 
work applications and protocols, or any other simulations 
involving complex network topologies, including modeling 
and simulation of network evolution. We apply our approach 
to the Autonomous System (AS) topology of the Internet 
annotated with business relationships between ASs. This 
topology captures the large-scale structure of the Internet. 
In depth understanding of this structure and tools to model 
it are cornerstones of research on future Internet architec- 
tures and designs. We find that our techniques are able 
to accurately capture the structure of annotation correla- 
tions within this topology, thus reproducing a number of 
its important properties in synthetically-generated random 
graphs. 

Categories and Subject Descriptors 

C.2.1 [Network Architecture and Design]: Network 
topology; C.2.5 [Local and Wide-Area Networks]: In- 
ternet; G.3 [Probability and Statistics]: Distribution 
functions, multivariate statistics, correlation and regression 
analysis; G.2.2 [Graph Theory]: Network problems 
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1. INTRODUCTION 

Simulations of new network protocols and architectures 
are pointless without realistic models of network structure 
and evolution. Performance of routing [26], multicast [34], 
and other protocols depends crucially on network topol- 
ogy. Simulations of these protocols with inaccurate topology 
models can thus result in misleading outcomes. 

Inaccuracies associated with representing complex net- 
work topologies as simple undirected unweighted graphs come 
not only from potential sampling biases in topology mea- 
surements [27, 10, 11], but also from neglecting link and 
node annotations. By annotations we mean various types 
of links and nodes that abstract their intrinsic structural and 
functional differences to a certain degree. For example, con- 
sider the Internet topology at the Autonomous System (AS) 
level. Here, link annotations may represent different busi- 
ness relationship between ASs, e.g., customer-to-provider, 
peer-to-peer, etc. [13], while node annotations may represent 
different types of ASs, e.g., large or small Internet Service 
Providers (ISPs), exchange points, universities, customer en- 
terprises, etc. [15]. In router-level Internet topologies, link 
annotations can be different transmission speeds, latencies, 
packet loss rates, etc. One can also differentiate between 
distinct types of links and nodes in other networks, such 
as social, biological, or transportation networks. In many 
cases, simply reproducing the structure of a given network is 
insufficient; we must also understand and reproduce domain- 
specific annotations. 

We propose network annotations as a general framework 
to provide the next level of detail describing the "micro- 
scopic" structure of links and nodes. Clearly, since links 
and nodes are constituents of a global network, increasing 
description accuracy at the "microscopic" level will also in- 
crease overall accuracy at the "macroscopic" level as well. 
That is, including appropriate per-node or per-link anno- 
tations will allow us to capture and reproduce more accu- 
rately a variety of important global graph properties. In 
the AS topology case, for example, instead of considering 
only shortest paths, we will be able to study the structure 
of paths that respect constraints imposed by routing policies 
and AS business relationships. 

Higher accuracy in approximating network structure is 
desirable not only for studying applications and protocols 
that depend on such structure, but also for modeling net- 
work evolution. For example, realistic Internet AS topology 



growth models should be based on economic realities of the 
Internet since AS links are nothing but reflections of AS 
contractual relationships, i.e., results of business decisions 
made by organizations that the corresponding ASs repre- 
sent. Therefore, economy-based AS topology models nat- 
urally produce links annotated with AS relationships. AS 
relationship annotations are thus intrinsic to such models. 

Network annotations should also be useful for researchers 
studying only those networks that preserve some domain- 
specific constraints, thus avoiding "too random" networks 
that violate these constraints. Examples of such "techno- 
logical" constraints for router topologies include maximum 
node degree limits, specific relationships between node de- 
gree and centrality, etc. [28]. In this context, we note that 
any node or link attributes, including their degrees and cen- 
trality, are forms of annotations. Therefore, one can use the 
network annotation framework to introduce domain-specific 
or any other constraints to work with network topologies 
narrowed down to a specific class. We also note that the 
network annotation framework is sufficiently general to in- 
clude directed and weighted networks as partial cases, since 
both link directions and weights are forms of annotations. 

After reviewing, in Section 2, past work on network topol- 
ogy modeling and generation, which largely ignores annota- 
tions, we make the following contributions in this paper: 

• In Section 3, we demonstrate the importance of net- 
work annotations using the specific example of AS 
business relationships in the Internet. 

• In Section 4, we introduce a general network annota- 
tion formalism and apply it to the Internet AS topol- 
ogy annotated with AS business relationships. 

• In Section 5, we formulate a general methodology and 
specific algorithms to: i) rescale the annotation corre- 
lation profile of the observed AS topology to arbitrary 
network sizes; and ii) construct synthetic networks 
reproducing the rescaled annotation profiles. While 
we discuss our graph rescaling and construction tech- 
niques in the specific context of AS topologies, these 
techniques are generic and can be used for generating 
synthetic annotated networks that model other com- 
plex systems. 

• In Section 6, we evaluate the properties of the resulting 
synthetic AS topologies and show that they recreate 
the annotation correlations observed in real annotated 
AS topologies as well as other important properties 
directly related to common metrics used in simulation 
and performance evaluation studies. 

We conclude by outlining some implications and directions 
for future work in Section 7. 

2. RELATED WORK 

A large number of works have focused on modeling In- 
ternet topologies and on developing realistic topology gen- 
erators. Waxman [44] introduced the first topology gen- 
erator that became widely known. The Waxman genera- 
tor was based on the classical (Erdos-Renyi) random graph 
model [18]. After it became evident that observed networks 
have little in common with classical random graphs, new 
generators like GT-ITM [46] and Tiers [17] tried to mimic 



the perceived hierarchical network structure and were con- 
sequently called structural. In 1999, Faloutsos et al. [19] dis- 
covered that the degree distributions of router- and AS-level 
topologies of the Internet followed a power law. Structural 
generators failed to reproduce the observed power laws. This 
failure led to a number of subsequent works trying to resolve 
the problem. 

The existing topology models capable of reproducing power 
laws can be roughly divided into the following two classes: 
causality-aware and causality-oblivious. The first class in- 
cludes the Barabasi- Albert (BA) [2] preferential attachment 
model, the Highly Optimized Tolerance (HOT) model [6], 
and their derivatives. The BRITE [32] topology generator 
belongs to this class, as it employs preferential attachment 
mechanisms to generate synthetic Internet topologies. The 
models in this class grow a network by incrementally adding 
nodes and links to a graph based on a formalized network 
evolution process. One can show that both BA and HOT 
growth mechanisms produce power laws. 

On the other hand, the causality-oblivious approaches try 
to match a given (power-law) degree distribution without 
accounting for different forces that might have driven evo- 
lution of a network to its currently observed state. The 
models in this class include random graphs with given ex- 
pected [9] and exact [1] degree sequences, Markov graph 
rewiring models [31, 23], and the Inet [45] topology gen- 
erator. Recent work by Mahadevan et al. introduced the 
dK-series [29] extending this class of models to account for 
node degree correlations of arbitrary order. Whereas the 
dA"-series provides a set of increasingly accurate descrip- 
tions of network topologies represented as graphs, network 
annotations are another, independent and "orthogonal" to 
dK-series, way to provide more accurate and complete in- 
formation about actual complex systems that these graphs 
represent. 

Frank and Strauss first formally introduced the annotated 
(colored) random Markov graphs in [20] . In their definition, 
every edge is colored by one of T colors. More recently, 
Soderberg suggested a slightly different definition [42], where 
every half-edge, i.e., stub, is colored by one of T colors. 
Every edge is thus characterized by a pair of colors. This 
definition is very generic. It includes uncolored and stan- 
dard colored [20] random graphs, random vertex-colored 
graphs [40] 1 , and random directed graphs [5] as partial cases. 
Soderberg considers some analytic properties of the ensem- 
ble of these random colored graphs in [41]. In [43], he ob- 
serves strong similarities between random graphs colored by 
T colors and random Feynman graphs representing a pertur- 
bative description of a T-dimensional system from quantum 
or statistical mechanics. 

Recent works on annotation techniques specific to AS 
graphs include [12] and [8]. The GHITLE [12] topology gen- 
erator produces AS topologies with c2p and p2p annota- 
tions based on simple design heuristics and user-controlled 
parameters. The work by Chang et al. [8] describes a topol- 
ogy evolution framework that models ASs' decision criteria 
in establishing c2p and p2p relationships. Our methodol- 
ogy is different in that it explores the orthogonal, causality- 
oblivious approach to modeling link annotations. Its main 
advantage is that it is applicable to modeling any type of 
complex networks. 

1 Random graphs with colored nodes are a partial case of 
random graphs with hidden variables [3] . 
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Figure 1: Example AS topology annotated with AS 
relationships. The dotted lines represent shortest 
paths between ASs 4, 6 and 8 to AS 2. The dashed 
lines represent policy compliant paths from the same 
sources to the same destination. 

3. AS RELATIONSHIPS AND WHY THEY 
MATTER 

In this section, we introduce our specific example of net- 
work annotations — AS relationships. We first describe what 
AS relationships represent and then discuss the results of 
simple simulation experiments showing why preserving AS 
relationship information is important. 

AS relationships are annotations of links of the Internet 
AS-level topology. They represent business agreements be- 
tween pairs of AS neighbors. There are three major types of 
AS relationships: 1) customer-to-provider (c2p), connecting 
customer and provider ASs; 2) peer-to-peer (p2p), connect- 
ing two peer ASs; and 3) sibling-to-sibling (s2s), connecting 
two sibling ASs. This classification stems from the following 
BGP route export policies, dictated by business agreements 
between ASs: 

• exporting routes to a provider or a peer, an AS ad- 
vertises its local routes and routes received from its 
customer ASs only; 

• exporting routes to a customer or a sibling, an AS 
advertises all its routes, i.e., its local routes and routes 
received from all its AS neighbors. 

Even though there are only two distinct export policies, they 
lead to the three different AS relationship types when com- 
bined in an asymmetric (c2p) or symmetric (p2p or s2s) 
manner. 

If all ASs strictly adhere to these export policies, then 
one can easily check [21] that every AS path must be of 
the following valley-free or valid pattern: zero or more c2p 
links, followed by zero or one p2p links, followed by zero or 
more p2c links, where by 'p2c' links we mean c2p links in 
the direction from the provider to the customer. 

Routing policies reflect business agreements and economic 
incentives. For this reason, they are deemed more important 
than quality of service and other criteria. As a result, sub- 
optimal routing and inflated AS paths often occur. Gao and 
Wang [22] used BGP data to measure the extent of AS path 
inflation in the Internet. They found that at least 45% of 
the AS paths observed in BGP data are inflated by at least 
one AS hop and that AS paths can be inflated by as long as 
9 AS hops. 

Taking into account such inflation effects is important for 
meaningful and realistic simulation studies. For example, 



Table 1: Total number of paths for each AS with AS 
relationships enabled and AS relationships disabled. 
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12 
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consider the AS topology in Figure 1, which is a small part 
of the real (measured) AS topology annotated with AS re- 
lationships inferred using heuristics in [14]. Directed links 
represent c2p relationships that point towards the provider 
and undirected links represent p2p relationships. If we ig- 
nore AS relationships then the shortest paths from ASs 4, 
6, and 8 to AS 2 are shown with dotted lines. On the other 
hand, if we account for AS relationships these paths are no 
longer valid. In particular, the path 4^3^2 transverses two 
p2p links; the path 6— >3— +2 transverses a p2c link followed 
by a p2p link; and the path 8^1-^2 transverses a c2p link 
after having gone through a p2c link. As all these paths are 
not valid, they are not used in practice. The paths actually 
used are the policy compliant paths marked with dashed 
lines. 

In other words, the first effect of taking AS relationships 
into account is that paths become longer than the corre- 
sponding shortest paths. From a performance perspective, 
longer paths can affect metrics such as end-to-end (e2e) de- 
lay, server response time, jitter, convergence time, and oth- 
ers. 

To illustrate this effect, we simulated the topology in Fig- 
ure 1 using BGP++ [16] . We used a single router per AS and 
configured appropriate export rules between ASs according 
to the guidelines discussed above. We set the delay of each 
link to 10 milliseconds and the bandwidth to 400kbps. Then, 
we configured exponential on/off traffic sources at ASs 4, 6 
and 8 that send traffic to AS 2 at a rate of 500kbps. We 
run the simulation for 120 seconds; for the first 100 seconds 
we waited for routers to converge 2 and at the 100th sec- 
ond we started the traffic sources. We then measured the 
e2e delay between the sources and the destination with AS 
relationships disabled and enabled. 

In Figure 2 we depict the cumulative distribution func- 
tion (CDF) of the e2e delays for the both cases. We first 
notice that the CDF with AS relationships enabled shifts to 
the right, which means that there is a significant increase 
in the e2e delay. In particular, the average e2e delay with 
AS relationships enabled is 0.853 seconds, whereas without 
AS relationships it drops to 0.389 seconds. Besides the de- 
crease in the e2e delay, we see that the CDF with AS re- 
lationships is much smoother than the other CDF, which 
exhibits a step-wise increase. The reason for that difference 
is that in the former case we have more flows sharing multi- 
ple queues and, consequently, more diverse queue dynamics, 
while in the latter case, almost all paths are disjoint, leading 
to mostly fixed e2e delays. The observed difference signifies 
that the e2e delay with AS relationships enabled exhibits a 
much higher variability compared to the case with AS rela- 
tionships ignored. This difference in variability is likely to 
affect other performance metrics like jitter and router buffer 
occupancy. 

Another consequence of policy-constrained routing is that 
ASs have fewer alternative AS paths. For example, in Fig- 
typically routers take much less than 100 seconds to con- 
verge, but to be conservative we used a longer period. 
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Figure 2: CDF of e2e delay between traffic sources 
and destination. 



Table 2: Average bandwidth per flow with AS rela- 
tionships enabled or disabled. 



Flow 


4^2 


6^2 


8^2 


Bandwidth (Kbps) 


AS relationships disabled 
AS relationships enabled 


202 
113 


196 
164 


397 
121 



ure 1 when ignoring AS relationships AS 7 has three (one 
through each neighbor) disjoint paths to reach destination 2. 
One the other hand, with AS relationships enabled, AS 7 
has only one possible path through AS 5, since the other 
two paths are not valid. In Table 1, we show the total num- 
ber of paths we found in the BGP tables of the eight ASs 
in our simulations. The consistent decrease in the number 
of paths when AS relationships are enabled highlights that 
ignoring AS relationships increases the path diversity of the 
ASs in a simulation. Path diversity is an important prop- 
erty related to network robustness, vulnerability to attacks, 
links and router failures, load balancing, multi-path routing, 
convergence of routing protocols, and others. 

Yet another effect of policy routing is different distribu- 
tion of load on ASs and AS links. Indeed, due to the smaller 
number of available AS paths, compared to shortest path 
routing, some nodes and links are likely to experience greater 
traffic load. For example, in Figure 1 the dashed paths share 
the links from AS 7 to AS 5. On the other hand, when as- 
suming shortest path routing the three paths are mostly 
disjoint: only one link, the link between AS 3 and AS 2, is 
shared by two flows. Thus, AS links and nodes will receive 
greater load, compared to the case with AS relationships 
ignored. Higher load is likely to produce more packet loss, 
increased delay, congestion, router failures, and other unde- 
sirable effects. In Table 2 we list the average bandwidth in 
our simulations for each of the three flows with and without 
AS relationships enabled. We find that because of the in- 
creased load on the links between AS 7 and AS 5 the average 
bandwidth of the three flows decreases substantially. 

To summarize this section, we have provided three ex- 
amples showing that ignoring AS relationship annotations 
leads to inaccuracies, which make the corresponding prop- 
erties look "better" than they are in reality. Indeed, if AS 
relationships are ignored, then: 

• paths are shorter than in reality; 

• path diversity is larger than in reality; and 



Figure 3: The IK- and 2A'-annotations. Three differ- 
ent stub colors are represented by dashed (color red), 
dash-dotted (color green), and dash-double-dotted 
(color blue) lines. 

• traffic load is lower than in reality. 

4. NETWORK TOPOLOGY ANNOTATIONS 

In this section we first introduce our general formalism 
to annotate network topologies. We then show how this 
formalism applies to our example of the AS-level Internet 
topology annotated with AS relationships. 

4.1 General formalism 

Our general formalism is close to random colored graph 
definitions from [42] and borrows parts of the convenient 
dTf -series terminology from [29]. 

We define the annotated network as a graph G(V,E), 
\V\ = n and = m, such that all 2m edge-ends (stubs) 
of all m edges in E are of one of several colors c, c = 1 . . . C, 
where C is the total number of stub colors. We also allow 
for node annotations by an independent set of node colors 
= 1...®. We do not use node annotations in this paper 
and we do not include them in the expressions below in or- 
der to keep them clearer. It is however trivial to add node 
annotations to these expressions. 

Compared to the non-annotated case when the node de- 
gree is fully specified by an integer value k of the number of 
stubs attached to the node, we now have to list the numbers 
of attached stubs of each color to fully describe the node 
degree. Instead of scalar k, we thus have the node degree 
vector 

k = (fci, . . . ,k c ), 

which has C components k c , each specifying the number 
of c-colored stubs attached to the node, cf. the left side of 
Figure 3. The L 1 -norm of this vector yields the node degree 
with annotations ignored, 

c 

fc=|k|l=^fc c . (1) 

c=l 

The number n(k) of nodes of degree k defines the node 
degree distribution, 



in the large-graph limit. We can think of n(k) as a non- 
normalized form of -P(k). From the statistical perspective, 
the n(k) (P(k)) distribution is a multivariate distribution. 
Its C marginal distributions are the distributions of node 
degrees of each color c: 



(3) 



where the summation is over all vectors k' such that their 
c's component is equal to k c . The degree distribution n(k) 
thus represents per-node correlations of degrees of different 
colors. Following the terminology in [29], we call the node 
degree distribution n(k) (-P(k)) the lK-annotated distribu- 
tion. 

We then define the 2K -annotated distribution as correla- 
tions of annotated degrees of connected nodes, or simply as 
the number of edges that have stub of color c connected to 
a node of degree k and the other stub of color c' connected 
to a node of degree k', n(c, k; c', k'). See the right side of 
Figure 3 for illustration. 

As in the non-annotated case, the 2A"-distribution yields 
a more exhaustive statistics about the annotated network 
topology and fully defines the lif-distribution. To see that, 
we introduce the following notations: 

k = ( c ,k), 
k' = ( C ',k'), 

fj,{c,c) = l + (5(c,c'), 
A*(k,k') = l + (5(k,k'), 
/i(k,k') = l + *(k,k'), 
where S(x, x') is the standard Kronecker delta: 

1 if i 



5{x,x ) = 



x , 

otherwise, 



and x is either c, k, or k. With these notations, one can 
easily check that the normalized 2if-annotated distribution 
is 

P(k,k')=n(k,k')M(k,k')/(2m), (4) 

the number of edges of any pair of colors connecting nodes 
of degrees k and k' is 

n(k, k') = J2 k>(k, k')Mk, k'), 

c,c' 

the normalized form of this distributions is 

P(k, k') = n(k, k')At(k, k')/(2m), 
and the lA"-distribution is given by 



n(k) = ^>(k,kV(k,k')/fc, 



P(k) 



(5) 
(6) 



where k = 2m jn is the average degree. The last two expres- 
sions show how one can find the 1 K- annotated distribution 
given the 2A'-annotated distribution, and they look exactly 
the same as in the non-annotated case [29], except that we 
have vectors k, k' instead of scalars k, k' . 

The dK- annotated distributions with d > 2 [29] can be 
defined in a similar way. 



4.2 The AS relationship annotations 

In the specific case of the AS-level Internet topology that 
interests us in this paper, we have just three colors: cus- 
tomer, provider, and peer. We assign the following numeric 
values to represent these three colors: 

1 customer, 
c — { 2 provider, 
3 peer. 

These three stub annotations come under the following two 
constraints defining the only two types of edges that we have: 

1) c2p edges: if one stub of an edge is customer, then the 
other stub of the same edge is provider, and vice versa; and 

2) p2p edges: if one stub of an edge is peer, then the other 
stub of the same edges is also peer. The c2p edges are thus 
asymmetric, i.e., a generalization of directed edges, while the 
p2p edges are symmetric, i.e., a generalization of bi-directed 
or undirected edges. 

While the 27f-annotated distribution n(k, k') contains the 
most exhaustive information about the network topology, it 
has too many (seven) independent arguments. As a result, 
the full 27f-annotated statistics is extremely sparse, which 
makes it difficult to model and reproduce directly. We thus 
have to find some summary statistics of n(k, k') that we can 
model in practice. For each concrete complex network type, 
these summary statistics might be different. Given mea- 
surement data for a specific complex network, one would 
usually have to start with identifying a meaningful set of 
summary statistics of the 2A'-annotated distribution, and 
then proceed from there. At the same time, we believe that 
as soon as the 2if-annotated distribution fully defines an ob- 
served complex network, i.e., the network is 2if-annotated- 
random [29], one can generally use the set of summary statis- 
tics that we found necessary and sufficient to reproduce in 
order to model correctly the Internet AS topology. In the 
rest of this section, we list these statistics and describe the 
specific meanings that they have in the AS topology case. 

Degree distribution (DD). This statistics is the tra- 
ditional non-annotated degree distribution n(k), where k is 
as in eq. (1). The DD tells us how many ASs of each total 
degree k are in the network. 

Annotation distributions (ADs). The DD of an AS 
topology does not convey any information about the AS re- 
lationships. The initial step to account for this information 
is to reproduce the distributions of ASs with specific num- 
bers of attached customer, provider, or peer stubs. These 
annotation distributions (ADs) are the marginal distribu- 
tions n(fc c ), c = 1,2,3, of the lJf-annotated distribution. 
They are given by eq. (3). If fei (fe) customer (provider) 
stubs attach to an AS, then this AS has exactly ki {k2) 
providers (customers), since the c2p edges are asymmetric. 
Consequently, the ADs n(k\) and 71(^2) tell us how many 
ASs with the specific numbers of providers and, respectively, 
customers the network has. Since the p2p edges are sym- 
metric, the AD n(&3) is the distribution of ASs with specific 
numbers of peers. 

Annotated degree distribution (ADD). The ADs do 
not tell us anything about the correlations among anno- 



3 We ignore sibling relationships, since they typically account 
for a very small fraction of the total number of edges. As 
found in [13], the number of s2s edges is only 0.46% of the 
total number of edges in the AS-level Internet. 



tated degrees of the same node, i.e., how many customers, 
providers, and, simultaneously, peers a specific AS has. Cor- 
relations of this type are fully described by the 17f-annotated 
distribution in eq. (2), which we also call the annotated de- 
gree distribution (ADD). These correlations are present in 
the Internet. For example, large tier-1 ISPs typically have a 
large number of customers, i.e., large k^, no providers, i.e., 
zero fei, and a small number of peers, i.e., small k^. On the 
other hand, medium-size ISPs tend to have a small set of 
customers, several peers, and few providers. Ignoring the 
ADD while generating synthetic graphs can lead to artifacts 
like high-degree nodes with many providers — a property ob- 
viously absent in the real Internet. 

Joint degree distributions (JDDs). While the ADD 
contains the full information about degree correlations "at 
nodes," it does not tell us anything about degree correla- 
tions "across links," while the latter type of correlations 
is also characteristic for the Internet. For example, large 
tier-1 ISPs typically have p2p relationships with other tier- 
1 ISPs, not with much smaller ISPs, while small ISPs have 
p2p links with other small ISPs. In other words, p2p links 
usually connect ASs of similar degrees, i.e., k ~ k! . Sim- 
ilarly, c2p links tend to connect low-degree customers to 
high-degree providers, i.e., k -C k' . If we ignore these cor- 
relations, we can synthesize graphs with inaccuracies like 
p2p links connecting ASs of drastically dissimilar degrees. 
To reproduce these correlations, we work with the follow- 
ing summary statistics of the 2/f-annotated distribution in 
eq. (4): 

n c2p (k,k') = J2 n(l,k;2,k'), (7) 

k,k' | |k|i=fc, |k'|i=fc' 

n p2p {k,k') = n(3,k;3, k'), (8) 

k,k' | |k|i=fc, |k'|i=fc' 

where the summation is over such vectors k and k' that 
their L 1 -norms are k and k' respectively. The first expres- 
sion gives the number of c2p links that have their customer 
stub attached to a node of total degree k and provider stub 
attached to a node of total degree k' . The second expression 
is the number of p2p links between nodes of total degrees 
k and k' . In other words, these two objects are the joint 
degree distributions (JDDs) for the c2p and p2p links. 

In summary, we work with the four types of distributions, 
i.e., DD, ADs, ADD, and JDDs, that allow two types of 
classification: 

1. Univariate vs. multivariate distributions: 

(a) Univariate. The ADs and DD are distribution of 
only one random variable. 

(b) Multivariate. The ADD and JDDs are joint dis- 
tribution of three and two random variables. The 
marginal distribution of these variables are the 
ADs, cf. eq. (3), and DD, cf. eqs. (7,8), respec- 
tively. 

2. IK- vs. 2K-summary statistics: 

(a) ljf-derived. The DD, ADs, and ADD are fully de- 
fined by the lA'-annotated distribution: that is, 
we do not need to know the 27^-annotated distri- 
bution to calculate the distributions in this class. 



(b) 2if-derived. The JDDs are fully defined only by 
the 2if-annotated distribution. Note that it also 
defines the 17^-annotated distribution via eqs. (5,6). 

5. GENERATING ANNOTATED AS GRAPHS 

In this section we describe how we generate synthetic an- 
notated AS graphs of arbitrary sizes. We want our synthetic 
graphs to reproduce as many important properties of the 
original measured topology as possible. For this purpose 
we decide to explicitly model and reproduce the summary 
statistics of the 2A'-annotated distribution from Section 4.2, 
because [29] showed that by reproducing 2if-distributions, 
one automatically captures a long list of other important 
properties of AS topologies. In other words, the task of 
generating synthetic annotated topologies becomes equiv- 
alent to the task of generating random annotated graphs 
that reproduce the summary statistics of the 2A'-annotated 
distribution of the measured AS topology. 

We wish to be able to generate synthetic topologies of 
different sizes, but the 2_R"-summary statistics defined in 
Section 4.2 are all bound to a specific graph size. There- 
fore, in order to generate arbitrarily-sized graphs, we need 
first to rescale the 2Jf-summary statistics from the original 
to target graph sizes. We say that an empirical distribu- 
tion is rescaled with respect to another empirical distribu- 
tion, if the both distributions are defined by two different 
finite collections of random numbers drawn from the same 
continuous distribution. For example, the distributions of 
node scalar (or vector) degrees in two different graphs are 
rescaled with respect to each other if these degrees are drawn 
from the same continuous univariate (or multivariate) prob- 
ability distribution. We say that a 2if-annotated graph is 
rescaled with respect to another 2if-annotated graph, if all 
the 2_ff-summary statistics of the first graphs are rescaled 
with respect to the corresponding 2A'-summary statistics of 
the second graph. This definition of rescaling is equivalent 
to assuming that for each summary statistic, the same dis- 
tribution function describes the ensemble of empirical dis- 
tributions of the statistic in past, present, and future In- 
ternet topologies. In other words, we assume that the 2K- 
annotated correlation profile of the Internet AS topology is 
an invariant of its evolution. This assumption is realistic, as 
discussed, for example, in [36], where it is shown that the 
non-annotated IK- and 2A"-distributions of the Internet AS 
topology have stayed approximately the same during all the 
years (more than a decade) of the existing data time span. 

To illustrate what we mean by rescaling, consider the 
empirical distribution of peer degrees, i.e., the AD n(kz), 
in the measured AS topology annotated with AS relation- 
ships in Figure 4(a). The figure shows the empirical comple- 
mentary cumulative distribution function (CCDF) for peer- 
degrees of 19,036 nodes, i.e., 19,036 numbers of peer stubs 
attached to a node, and the largest such number is 448. 
The continuous probability distribution of Figure 4(b) ap- 
proximates the empirical distribution in Figure 4(a). Fig- 
ures 4(c), 4(d), and 4(e) show the CCDFs of three collections 
of 5,000, 20,000, and 50,000 random numbers drawn from 
the probability distribution in Figure 4(b). According to 
our definition of rescaling, the distributions in Figures 4(c), 
4(d), and 4(e) are rescaled with respect to the distribution 
of Figure 4(a). We see that all the empirical distributions 
have the same overall shape, but differ in the total number 
of samples and in the maximum values within these sam- 



pie collections. Distributions with larger maximums corre- 
spond, as expected, to bigger collections of samples. 

5.1 Overview of the approach 

We now move to describing the details of our approach, 
which consists of the following three major phases: 

1. Extraction. 

We first extract the empirical 2if-summary distribu- 
tions from available AS topology measurement data. 
We annotate links of the AS topology extracted from 
this data using existing AS relationship inference heuris- 
tics. This extraction step is conceptually simplest. On 
its output, we obtain the extracted 2_ftT-summary dis- 
tributions that are all bound to the size of the mea- 
sured AS graph. 

2. Rescaling. 

(a) We use the extracted empirical distributions to 
find their continuous approximations. Referring 
to our example in Figure 4, this step corresponds 
to computing the continuous probability distribu- 
tion in Figure 4(b) based on the empirical distri- 
bution in Figure 4(a). 

(b) We then use the computed probability distribu- 
tions to rescale the empirical distributions ob- 
tained at the extraction step. We generate a 
desired, target number of random scalar or vec- 
tor degree samples drawn from the correspond- 
ing probability distributions. The generated de- 
gree samples have empirical distributions that are 
rescaled with respect to the corresponding empir- 
ical distributions of the measured topology. Re- 
ferring to our example in Figure 4, this step cor- 
responds to generating the rescaled empirical dis- 
tributions in Figures 4(c), 4(d), and 4(e) based 
on the probability distribution in Figure 4(b). 

3. Construction. 

Finally, we develop algorithms to generate synthetic 
graphs that have their 2/T-summary distributions equal 
to given distributions, i.e., to the corresponding distri- 
butions obtained at the previous step. The generated 
graphs thus reproduce the rescaled replicas of the 2K- 
annotated distribution of the original topology, but 
they are "maximally random" in all other respects. 

In the rest of this section, we describe each of these phases 
in detail. 

5.2 Extraction 

We extract the AS topology from the RouteViews [38] 
data, performing some standard data cleaning, such as ig- 
noring private AS numbers, AS sets, etc. [30] The resulting 
AS graph is initially non-annotated. To annotate it, we in- 
fer c2p and p2p relationships for AS links using the heuris- 
tics in [13]. We thus obtain the real Internet AS topology 
annotated with c2p and p2p relationships. Given this an- 
notated topology, we straightforwardly calculate all the em- 
pirical 27f-summary distributions that we have defined in 
Section 4.2. 

While the extraction phase is conceptually and techni- 
cally the simplest phase of the overall approach, it is its 



basis. Therefore the quality of the input Internet topology 
data is a natural concern. This data is known to exhibit 
a variety of vagaries, e.g., due to sampling biases [27, 10, 
11]. However, our approach is oblivious with respect to data 
quality. It takes any available data, extracts the described 
statistics from it, and reproduces them, properly rescaled, 
in random synthetic graphs. A given input topology data 
set thus defines an ensemble of random graphs generated 
by our method. By construction, all graphs in this ensem- 
ble reproduce the described set of annotated distributions. 
In addition, in Section 6, we perform sensitivity analysis in 
order to see the strength of fluctuations of these and other 
basic graph metrics within an ensemble. The quality of these 
graph ensembles, in terms of how veraciously they reflect re- 
ality, will improve as the quality of available topology data 
improves in the future. In this paper, we simply illustrate 
our approach with the currently available topology data. 
The RouteViews [38] is just one of very few sources of such 
data [30]. We select it because it appears to be the most 
frequently cited Internet topology data source. 

5.3 Rescaling 

Our rescaling approach differs for univariate and multi- 
variate distributions. 

5.3.1 Rescaling univariate distributions 

We recall from the end of Section 4.2 that we have the 
following two types of univariate distributions: the ADs and 
the DD. Here we describe how we rescale ADs. We note that 
we do not have to rescale the DD the same way. The reason 
is that our approach to rescaling the ADD, which we discuss 
below in Section 5.3.2, automatically takes care of rescaling 
the DD, since the ADD is the distribution of degree vectors 
and the DD is the distribution of the L 1 -norms of these 
vectors, cf. eq. (1). 

The first problem we face trying to compute a continu- 
ous approximation for a given finite empirical distribution 
is that we have to not only interpolate between points of the 
empirical distribution, but also extrapolate above its maxi- 
mum value. For example, if we want to construct a synthetic 
graph bigger than the original, then we expect its maximum 
degree to be larger than the maximum degree in the orig- 
inal graph. Therefore we have to properly extrapolate the 
observed degree distribution beyond the observed maximum 
degree. 

We solve this problem by fitting the univariate empiri- 
cal distributions with smoothing splines. Spline smoothing 
is a non-parametric estimator of an unknown function rep- 
resented by a collection of empirical data points. Spline 
smoothing produces a smooth curve passing through or near 
the data points. For example, the curve in Figure 4(b) is a 
smooth spline of the empirical distribution of Figure 4(a). 
Spline smoothing can also extrapolate the shape of an em- 
pirical function beyond the original data range. 

Another reason to select spline smoothing is that it comes 
useful for fitting distributions that do not closely follow regu- 
lar shapes, e.g., "clean" power laws. The ADs of the Internet 
topology do not necessarily have such regular shapes. For 
example, the distribution of the number of peers, i.e., the 
AD n{kz), has a complex shape that we found impossible to 
fit with any single-parametric distribution. 

Among available implementations of spline smoothing tech- 
niques, we select the one in the smooth. spline method of 
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Figure 4: Rescaling an empirical distribution. The distributions in the three bottom figures are rescaled with 
respect to the empirical distribution in the first figure. The distribution in the second figure is a continuous 
approximation of the distribution in the first figure and is used to generate the rescaled distributions in the 
bottom figures. For each discrete distribution, we show its maximum in the top-right corners of the plots. 



the R project [37], a popular statistical computing package. 
The specific details of this technique are in [7] . 

We can approximate with splines either the CDFs or CCDFs 4 
of the ADs obtained at the extraction step. We chose to fit 
the CCDFs rather than the CDFs because the former better 
capture the shapes of high-degree tails of our heavy-tailed 
ADs. 

Another important detail is that we can define an em- 
pirical CCDF to be either a left- or right-continuous step 
function [24]. Usually an empirical CCDF at some point x 
is defined as the fraction of samples with values strictly 
larger than x, which means that the distribution is right- 
continuous and that the probability of a value larger than 
the observed maximum value is zero, whereas the proba- 
bility of a value smaller than the observed minimum value 
is unspecified. For degree distributions, we know that the 
probability of a degree smaller than zero is zero, 5 but we do 
not know the probability of a degree larger than the maxi- 
mum observed degree. For this reason, we decide to fit the 
left-continuous variants of empirical CCDFs, i.e., we define 
a CCDF at some point x as the fraction of samples with 
values greater than or equal to x. 

Having the original ADs fitted with splines and assuming 
that our target graph size is N, we finally use the standard, 
inverse-CDF method to produce N random numbers that 
follow the continuous distributions given by the splines. Re- 
call that the inverse-CDF method is based on the observa- 
tion that if the CDF of N random numbers Xj, j = 1 . . . N, 
closely follows some function F(x), then the distribution of 
numbers j/j = F(xj) is approximately uniform in the inter- 
val [0,1]. As its name suggests, the inverse-CDF inverts 
this observation and operates as follows [25] : given a target 
CDF F(x) and a target size AT of a collection of random 
samples, the method first generates N random numbers yj 
uniformly distributed in [0, 1] and then outputs numbers 
xj — F~ 1 (yj), where F~ 1 (y) is the inverse of CDF F(x), 
i.e., F^ 1 (F(x)) — x. The CDF of numbers Xj closely fol- 
lows F(x). Figures 4(c), 4(d), and 4(e) show random num- 
bers generated this way. These numbers follow the distri- 
bution in Figure 4(b). To compute values of inverse CDFs 
on N random numbers uniformly distributed in [0, 1] in our 
case, we use the predict . smooth. spline method of the R 
project. Since the random numbers produced in this way 
are not, in general, integers, we convert them to integer de- 
gree values using the floor function. We have to use the 
floor and not the ceiling function because we work with left- 
rather than right-continuous distributions. 

The outcome of the described process is three sets of 
N random numbers that represent iV customer degrees d\, 
N provider degrees d 3 2 , and N peer degrees d 3 3 of nodes in the 
target graph, j = 1 . . . N. We denote the CDFs of these ran- 
dom numbers by -Di(di), D^di), and D^ds) respectively. 
By construction, these distributions are properly rescaled 
versions of the customer-, provider- and peer-annotation dis- 
tributions (ADs) in the measured AS topology. 

5.3.2 Reseating multivariate distributions 

Rescaling multivariate distributions is not as simple as 
rescaling univariate distributions. Our approach for rescal- 
ing univariate distributions is not practically applicable to 

4 Recall that the CCDF of CDF F{x) is 1 - F(x). 

5 For a given node, some but not all the degrees k\ , ki, and hi 

can be equal to zero. 



rescaling multivariate distributions because it is difficult to 
fit distributions that have many variates and complex shapes. 
To rescale multivariate distributions, we use copulas [33], 
which are a statistical tool for quantifying correlations be- 
tween several random variables. Compared to other well- 
known correlation metrics, such as Pearson's coefficient, cop- 
ulas give not a single scalar value but a function of several 
arguments that fully describes complex, fine-grained details 
of the structure of correlations among the variables, i.e., 
their correlation profile. 

According to Sklar's theorem [39], any p-dimensional mul- 
tivariate CDF F of p random variables k = (ki, . . . , k p ) can 
be written in the following form: 

F(k) = H(u), (9) 

where u is the p-dimensional vector composed of the F's 
marginal CDFs F m (k m ), m = 1, . . . ,p: 

F m (k m ) = F(oo, . . . ,oo,fc m ,oo, . . . , oo), (10) 
u = (Fi(fci), . . . , Fp{kp)). (11) 

The function H is called a copula and each of its marginal 
distributions is uniform in [0, 1]. 

Copulas play a critical role at the following two steps in 
our approach for rescaling multivariate distributions. First, 
they allow us to split a multivariate distribution of the origi- 
nal, measured topology into two parts: the first part consists 
of the marginal distributions F m , while the second part is 
their correlation profile, i.e., copula H. These two parts 
are independent. Therefore, we can independently rescale 
the marginal distributions and the correlation profile. This 
property tremendously simplifies the rescaling process. The 
marginals are univariate distributions that we rescale as in 
Section 5.3.1, while this section contains the details of how 
we rescale the correlation profile. We use copulas the second 
time to merge together rescaled marginals and their corre- 
lation profile to yield a rescaled multivariate distribution in 
its final form. 

In Figure 5 we present a high-level overview of our ap- 
proaches for rescaling univariate and multivariate distribu- 
tions. To rescale an original empirical univariate distribu- 
tion, we first approximate it with splines and then use these 
splines to generate random numbers. We split the process 
of rescaling an original empirical multivariate distribution 
into two independent rescaling sub-processes, i.e., rescaling 
the marginals and their copula. We rescale the marginals 
as any other univariate distributions. To rescale the cop- 
ula, we re-sample measured correlation data as we describe 
below in this section. At the end of multivariate rescaling, 
we merge the rescaled marginals with the rescaled copula 
to yield a rescaled multivariate distribution. One can see 
from Figure 5 that multivariate rescaling is a "superset," in 
terms of actions involved, of univariate rescaling. The fol- 
lowing three steps summarize the high-level description of 
our multivariate rescaling approach: 

1. extract and rescale the univariate marginals of a multi- 
variate distribution as described in Section 5.3.1 (boxes ( 
(b), and (c) in Figure 5); 

2. extract and rescale the copula of the multivariate dis- 
tribution (boxes (d) and (e) in Figure 5); and 

3. merge the rescaled marginals and copula yielding a 
rescaled multivariate distribution (box (f) in Figure 5). 



In the rest of this section, we provide the low-level details for 
the last two steps, using the ADD multivariate distribution 
as an example. 

At Step 2, we compute a rescaled ADD copula as follows. 
The collected AS topology has n nodes, and for each node i, 
i = 1 . . . n, we record its degree vector ki = (k\, k 2 , k\) pro- 
ducing an n-sized set of degree triplets. We then perform 
statistical simulation on this set to produce another set of a 
desired size that has the same correlations as the measured 
data ki. Specifically, we re-sample, uniformly at random 
and with replacement, N degree triplets from the set of vec- 
tors ki, where N is the target size of our synthetic topol- 
ogy. We thus obtain an iV-sized set of random triplets kj, 
j = 1 . . . N, and we denote their joint CDF by F(k). By 
construction, the empirical distribution of triplets k, has 
the same correlation profile as original triplets ki . This pro- 
cedure corresponds for box (d) in Figure 5. 

Next, see box (e) in Figure 5, we compute the empir- 
ical copula of distribution F(k). By definition, the cop- 
ula of -F(k) is simply the joint distribution of vectors u in 
eqs. (9,11). Therefore, we first compute the marginal CDFs 
Fi(fci), F^ki), and ^3(^3) as CDFs of the first, second, and 
third components of vectors kj : 

ul = F m (ki n )=rUN, m= 1,2,3, (12) 

where r m is the rank (position number) of value k m in the 
iV-sized list of values k m sorted in the non-decreasing order. 
Random triplets Uj — (Fi(k{), F2(k 2 ), ^3(^3)) are uniformly 
distributed in the cube [0, 1] , and their joint CDF H(u), 
u = (Fi(fci), F^fe), -£3(^3)), is the empirical copula for dis- 
tribution F(k), cf. eq. (9), that describes the correlations 
among fei, ki, and £3. 

At Step 3, box (f) in Figure 5, we merge the rescaled 
marginals D m (d m ), m = 1, 2,3, from Section 5.3.1 and cop- 
ula H(u) by computing the target graph degree triplets 
qj = 92,93), 3 = !, ■ ■ ■ ,N, as 

g4 = O m 1 «), (13) 

where D m l are inverse CDFs of D m from Section 5.3.1. 
By construction, the correlation profile of annotation- degree 
vectors qj is the same as of the ADD in the original topology, 
while the distributions of their components q m are rescaled 
ADs. 

Algorithm 1 lists the described low-level details of our 
multivariate rescaling, using the ADD as an example. 

We conclude our discussion of rescaling with the follow- 
ing remark. Recall from the end of Section 4.2 that we have 
the following two types of multivariate statistics: the ADD 
and the JDDs. As illustrated in Figure 5, we rescale the 
ADD using all the three steps described in this section. For 
rescaling a JDD, it is not necessary to separately rescale its 
marginals, i.e., to use the first step of the described rescal- 
ing process, since the marginals of JDDs are distributions 
of scalar degrees that we automatically rescale during the 
ADD rescaling. To rescale a JDD, we execute only the sec- 
ond step of the described rescaling process to obtain the 
rescaled empirical JDD copula. We then use this copula to 
determine proper placement of edges in the final synthetic 
graph that we construct. In other words, the last, third 
step of our multivariate rescaling process applied to JDDs 
takes place during the graph construction phase, which we 
describe next. 



Algorithm 1: Rescaling ADD 

Input: Degree vectors ki = (k\, k\,k%), i — 1 . . . n, of 

the measured topology; 
Input: Size N of the target synthetic topology. 

// Step 1: AD rescaling 
forall m = 1, 2, 3 do 

Let k m be the list of the m th component values of 

vectors ki; 

Approximate distribution k m by a smoothing spline 

Sample N numbers d? m , j = 1 . . . N, with 
probability distribution given by S m ; 
Let D m (d m ) be the CDF of d m . 
end 

// Step 2: copula rescaling 

Re-sample N degree triplets kj from the set of ki; 
forall m = 1, 2, 3 do 

Let km be the list of the m th component values of 

vectors kj; 

Sort list k m in the non-degreasing order of values; 
forall j = 1 . . . N do 

Let r 3 m be the position number of value k m in 

the sorted list; 

u m = r m/N. 

end 
end 

// Step 3: merge rescaled ADs and the ADD 

copula 
forall m = 1, 2, 3 do 

forall j = 1 . . . N do 

i 9m F) m (^m)- 

end 
end 

Output: Degree vectors qj — (q{, q 3 2 , q^) , j = 1 ■ ■ ■ N , 
of the synthetic topology. 



5.4 Construction 

We describe the IK- and 27^-annotated random graph 
constructors that are both generalizations of the well-known 
configuration or pseudograph approach in the terminology 
of [29]. The lif-constructor requires only the rescaled ADD, 
while the 27^-constructor needs also the rescaled JDD cop- 
ulas. 

5.4.1 Constructing lK-annotated random graphs 

Using the rescaled degree vectors q^, j = 1 . . . JV, we con- 
struct IK- annotated random graphs using the following al- 
gorithm: 

1. for each vector qj = (q{, q J 2 , q J 3 ), prepare a node with q\ 
customer stubs, q 3 2 provider stubs, and q 3 3 peer stubs; 

2. randomly select pairs of either customer-and-provider 
or peer-and-peer stubs, and connect (match) them to- 
gether to form c2p or p2p links; 

3. remove unmatched stubs, multiple edges between the 
same pair of nodes (loops), links with both ends con- 
nected to the same node (self-loops), and extract the 
largest connected component. 
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Figure 5: Overview of rescaling univariate and multivariate distributions. 



The last step deals with the known problem of the pseu- 
dograph approach. As its name suggests, it does not neces- 
sarily produce simple connected graphs. In general, it gener- 
ates pseudographs, i.e., graphs with (self-)loops, consisting 
of several connected components. The size of the largest 
connected component is usually comparable with the total 
pseudograph size, while all others are small. Extraction of 
this largest connected component and removal of all (self- 
)loops 6 alters the target degree distributions. Therefore, 
the resulting simple connected graph has a slightly different 
ADD than the one on the algorithm input. 

Annotations alleviate this problem since they introduce 
a series of additional constraints. For example, in the non- 
annotated case, loops tend to form between high-degree ASs, 
simply because these ASs have a lot of stubs attached to 
them after step 1 of the algorithm. In the annotated case, 
the number of such loops is smaller because most stubs at- 
tached to high-degree ASs are annotated as provider stubs 
that can be matched only with customer stubs attached 
mostly to low-degree ASs. 

Still, the lif-annotated random graphs are not perfect 
as, for example, p2p links might end up connecting nodes 
with drastically dissimilar degree, cf. the JDD discussion in 
Section 4.2. The 2iS'-annotated random graphs do not have 
this problem. 

5.4.2 Constructing 2K-annotated random graphs 

Earlier work [29] extends the pseudograph approach to 
non-annotated 2A"-distributions. We extend it even further 
for the 27^-annotated case in the following algorithm: 

1. for each vector = (g| , q J 2 , q$), prepare a node with 
q\ customer stubs, q 3 , provider stubs, q 3 3 peer stubs, 
and total degree q 3 = \cy\i; 

2. determine the total numbers n C 2 P and n P 2 P of c2p and 
p2p edges in the target graph as the maximum pos- 
sible number of customer-and-provider and peer-and- 
peer stubs that can be matched within the stub col- 
lection qj; 

3. rescale the c2p and p2p JDD copulas 7 to target sizes 
of n C 2 P and n P 2 P degree pairs (q,q') corresponding to 
c2p and p2p edges between nodes of total degrees q 
and q' in the target graph; 

6 Self-loops are removed, while multiple edges between the 
same pair of nodes are mapped to a single edge between the 
two nodes. 

7 See the remark at the end of Section 5.3.2. 



4. for each c2p (or p2p) degree pair (q, q') select ran- 
domly a customer (or peer) stub attached to a node of 
degree q and a provider (or peer) stub attached to a 
node of degree q' and form a c2p (or p2p) edge; 

5. use the procedure described below to rewire (self-)loops; 

6. remove unmatched stubs, remaining (self-)loops, and 
extract the largest connected component. 

The following rewiring procedure reduces the number of 
edges removed from the final graph. For each edge involved 
in a (self-) loop between nodes of degrees qi and q2, we ran- 
domly select two non-adjacent nodes of degrees qi and q2 
and move the edge to these nodes. This procedure retains 
a large number of edges that would, otherwise, be removed 
from the graph. In theory, this procedure may skew the orig- 
inal 2A'-summary statistics. In practice, however, it alters 
these statistics negligibly. 

The resulting graph has both the ADD and JDDs ap- 
proximately the same as those obtained after rescaling. Mi- 
nor discrepancies are due to the last step of the algorithm, 
but the number of (self-)loops and small connected compo- 
nents are even smaller than in the 17^-annotated case. The 
reason for these improvements is yet additional structural 
constraints, compared with the 1 if- annotated case. For ex- 
ample, the JDD-induced constraints force the algorithm to 
create only one link between a pair of high-degree nodes, or 
no links between a pair of nodes of degree 1, thus avoiding 
creation of many connected components composed of such 
node pairs. The original graph does not have such links, and 
the rescaled JDDs preserve these structural properties, thus 
improving the resulting graph quality. 

6. EVALUATION 

In this section, we present results of evaluation of our 
2A"-annotated graph generation method. We also evaluated 
the 1/f-annotated generator and found that, as expected, 
it produced less accurate graphs with defects such as those 
mentioned in Section 4.2, e.g., with p2p links connecting 
ASs of dissimilar degrees, etc. 

Experiments. To evaluate the accuracy of our 2isT-annotated 
generator, we want to compare graphs it produces with the 
measured annotated Internet AS graph from Section 5.2. 
To simplify comparisons, we select one, most representa- 
tive graph from a set of 50 random synthetic graphs. We 
select this most representative graph as follows. We first 
look for a simple graph metric that exhibits high variability 



across the generated graphs. One such metric is the maxi- 
mum degree. The expected maximum degree in an n-node 
graph with a power-law degree distribution P(k) ~ fc~ 7 is 
k m ax ~ n 1 ^ 7-1 ' [4]. Exponent 7 is approximately 2.1 for 
the Internet AS topology. This value of 7 stays constant as 
the Internet grows, and it implies almost linear scaling of 
the maximum degree since 1/(7 — 1) « 0.9, which is consis- 
tent with scaling of maximum degree in historical Internet 
topologies [36]. For these reasons, our most representative 
graph is the one with its maximum degree closest to its ex- 
pected value, across all the generated graphs. 

In addition, we evaluate the variance of important graph 
metrics described below, across ensembles of random graphs 
that we generate. Studying the variance properties of a 
graph generation technique is essential for estimating struc- 
tural differences between equal-sized random graphs gen- 
erated by the model, and for gaining insight on how such 
differences affect performance evaluation experiments. The 
variance properties of a graph generation technique is asso- 
ciated with the following tradeoff. On the one hand, vari- 
ance should be small so that generated graphs closely match 
the observed topology. On the other hand, though, random 
graphs should not all be identical or almost identical, be- 
cause if they do not exhibit sufficient structural diversity, 
then they have little value for performance evaluation stud- 
ies. In our experiments, we compute and report the variance 
of important graph metrics in sets of 50 equal-sized random 
graphs. 

Metrics. Since it is practically impossible to compare 
graphs over every existing graph metric, we select a set of 
metrics that were found particularly important in the In- 
ternet topology literature. These metrics include the degree 
distributions that we deal with in previous sections, assorta- 
tivity coefficient, distance distribution, and spectrum. The 
assortativity coefficient is essentially the Pearson correlation 
coefficient of the joint degree distribution ( JDD) . Its positive 
(negative) values indicate that degrees of connected nodes 
are positively (negatively) correlated, meaning that nodes 
with similar (dissimilar) degrees interconnect with higher 
probabilities. The distance distribution is the distribution of 
lengths of the shortest paths in a graph, which we compute 
both with and without constraints imposed by annotations 
(routing policies). The spectrum of a graph is the set of 
the eigenvalues of its Laplacian L. The Laplacian's matrix 
elements Lij are -l/(kikj) 1/2 if there is an edge between 
node i of degree ki and node j of degree kj; 1 if i = j; and 
otherwise. Among the n eigenvalues of L, the smallest 
non-zero and largest eigenvalues are most interesting, since 
they provide tight bounds to a number of important network 
properties. For more details on these and other metrics, and 
why they are important, see [30]. 

Results. In Figure 6 we plot the ADs of the measured AS 
topology and of the most representative synthetic graph of 
the equal size. We observe that the distributions of the cus- 
tomer, provider, and peer degrees in the synthetic graph are 
very close to the corresponding distributions in the measured 
topology. The close match demonstrates that: 1) spline- 
smoothing accurately models complex ADs of real Inter- 
net topologies, 2) random number generation yields empiri- 
cal distributions that follow the modeled distributions, and 
3) rewiring and removal of (self-) loops do not introduce any 
significant artifacts. It is, of course, expected that our gen- 
erator accurately reproduces ADs, as they are part of the 
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Figure 6: CCDFs of the number of customer, 
provider, and peer stubs in the synthetic versus 
measured AS topology. 



2if-summary statistics we explicitly model. We also confirm 
that synthetic graphs, also as expected, closely reproduce all 
the other summary statistics that we explicitly model: the 
DD, ADD, and JDDs of the synthetic graph are very close 
to the originals. We do not show the corresponding plots 
for brevity. 

In Figures 7(a) and 7(b) we compare the distance distri- 
butions of the measured and equal-sized synthetic topology 
ignoring and accounting for annotation-induced, i.e., rout- 
ing policy, constraints. In the former case, we calculate 
lengths of the standard shortest paths between nodes in 
a graph as if the graph was non-annotated. In the latter 
case, we find lengths of shortest valid, i.e., valley-free, paths 
defined in Section 3. In both the non-annotated and anno- 
tated cases, we observe that the distance distribution in the 
synthetic graph closely matches the distance distribution in 
the measured topology, even though we have not explicitly 
modeled or tried to reproduce the distance distributions. 
We also observe that the distance distributions in the non- 
annotated and annotated cases are different, meaning that 
annotations in the synthetic graph properly filter realistic, 
policy-constrained paths from the set of all possible path in 
the non-annotated case. 

In Table 3 we compare the measured topology with syn- 
thetic graphs of different sizes over a set of important scalar 
metrics, including those we do not explicitly model or try 
to reproduce, e.g., the eigenvalues of the Laplacian, etc. We 
compute these metrics for five synthetic graphs of sizes 5, 000, 
10, 000, 30, 000, and 19, 036 nodes, the last size being equal 
to the size of the original topology. The first three metrics 
are the number of (c2p or p2p) edges in a graph. We ob- 
serve that the number of such edges grows almost linearly 
with the number of nodes. This observation is consistent 
with that the average degree in historical Internet topologies 
stays almost constant [36]. The fourth metric is the maxi- 
mum degree. As expected, the maximum degree grows with 
the size of the graph slightly slower than linearly. The next 
five metrics in the table describe properties that have stayed 
relatively constant in historical Internet topologies. These 
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Figure 7: Distance distributions in synthetic and 
measured AS topologies. 



properties have small variations in the synthetic graphs as 
well. 

Next we investigate the benefit of modeling the ADs and 
ADD in addition to the DD and JDDs. Previous work [29] 
shows that modeling DD and JDD is sufficient for captur- 
ing and reliably reproducing most important non-annotated 
graph metrics. The main value of modeling ADs and ADD 
is that our generated synthetic graphs are properly anno- 
tated. We saw in Section 3 that the Internet topology an- 
notations are important. Here we provide another evidence 
that they are non-trivial. Specifically, in Figures 8(a), 8(b), 
and 8(c) we plot the total degrees of ASs in the measured 
AS topology versus their annotation degrees: the number of 
customers fci, providers fe, and peers ks, respectively. We 
observe that a given total degree can correspond to a wide 
range of different values of fci, fe, and ks. The JDD pro- 
vides information only on total node degrees and on their 
correlations, whereas it is completely agnostic to annotation 
degrees. On the other hand, the ADs and ADD capture the 
distribution of annotation degrees and the correlations be- 
tween annotation degrees, respectively. Therefore, the JDD 
alone is in principle incapable of capturing topology anno- 
tations, while the benefit of modeling ADs and ADD lies in 
reproducing realistic annotations in generated graphs. 

To quantify the variance properties of randomly generated 
graphs, wc compute the standard deviation of our metrics 
across sets of 50 random graphs. We construct 4 sets with 
topologies of 5,000, 10,000, 19,036, and 30,000 nodes, a total 
of 200 random graphs. Among our evaluation metrics, we 
do not compute the eigenvalues of the Laplacian and the as- 
sortativity coefficient, since they require prohibitively long 
computation times for 200 graphs. In Table 4, we show the 
standard deviation and mean value of the remaining metrics. 



graph metric 


std. deviation / mean 


5000 
nodes 


10000 

nodes 


19036 
nodes 


30000 

nodes 


c2p edges 


399/17,905 


267/36,876 


349/71,055 


410/112,549 


p2p edges 


100/1,238 


203/2,980 


396/6,412 


541/12,863 


max degree 


387/1,618 


417/2,090 


471/2,335 


376/2,599 


av. degree 


0.08/3.83 


0.03/3.99 


0.02/4.07 


0.02/4.11 


av. distance 


0.13/3.16 


0.09/3.40 


0.10/3.61 


0.06/3.77 



The maximum degree exhibits the highest standard devia- 
tion (with respect to the mean) taking values between 376 
and 471 for graphs of different size. The high variance of the 
maximum degree is expected, since the degree distribution 
of Internet topologies is highly skewed. On the other hand, 
the remaining metrics in Table 4 exhibit low variance. These 
metrics reflect aggregate graph properties and can be mod- 
eled as a sum of many i.i.d. random variables. Therefore, 
according to the central limit theorem, their distribution 
is approximately normal and their variance is consequently 
smaller than the variance of the maximum degree. 

An important difference between the graph generation 
method described in this study and the graph generation 
methods described in our previous work [29] is that the for- 
mer exhibits higher variance. The two methods are concep- 
tually similar in generating synthetic graphs that reproduce 
the correlation profile of an observed topology — albeit [29] 
does not consider annotations. They differ in that our pre- 
vious techniques directly use the degree distribution or cor- 
relations of an observed topology to generate new similar 
topologies. On the other hand, the present work first mod- 
els the degree correlations of a topology and then uses ran- 
dom number generators to produce synthetic degree distri- 
butions fed into final graph constructors. In simpler words, 
our present technique induces more randomness by means 
of the synthetic generation of degree correlations based on 
the correlation profile extracted from the real topology. The 
two approaches are complementary and together provide a 
wider range of options for generating synthetic topologies 
with desired variance characteristics. 

Overall, our evaluation results show that: 

• 2if-annotated random graphs generated with our ap- 
proach faithfully reproduce a number of important prop- 
erties of Internet topologies; 

• rescaled graphs exhibit the expected behavior accord- 
ing to a number of definitive graph metrics, i.e., these 
metrics are either properly-rescaled or stay relatively 
stable as the size of synthetic graphs varies; 

• the profile of correlations between annotation and total 
degrees is diverse; and 

• random graphs generated with our method exhibit small 
variance, although higher than in our previous work [29] . 

7. CONCLUSIONS 

In this work, we have focused on the problem of gener- 
ating synthetic annotated graphs that model real complex 
networks. Our techniques are likely to have many applica- 
tions not only in networking, but also in other disciplines 



Table 3: Scalar metrics of synthetic and collected graphs. Note that smallest eigenvalues are positive, but 
some may round to zero. 
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Number of edges 


40115 


10179 


20730 
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Number of c2p edges 


36188 
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36146 


56870 


Number of p2p edges 


3927 


770 


1813 


3448 


5983 


Maximum degree 


2384 


1014 


1492 


2385 


3461 


Average degree 


4.21 


4.07 


4.15 


4.16 


4.19 


Assortativity coefficient 


-0.20 


-0.30 


-0.24 


-0.25 


-0.18 


Largest eigenvalue of Laplacian 


1.97 
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1.88 


1.91 


1.92 


Smallest eigenvalue of Laplacian 
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0.00 
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0.00 


Average distance 
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Figure 8: Scatterplots demonstrating the diversity of annotation degrees and that total node degrees are 
agnostic with respect to annotations. 



where annotated graphs are used to abstract and represent 
network structure. For example, two groups have recently 
contacted us to discuss our techniques as they were search- 
ing for tools to generate synthetic, semantic-rich, i.e., anno- 
tated, networks for their simulation studies. The first group 
works on modeling the European powerline networks, while 
the second is in brain and neural network research. Other 
networks to which our techniques are immediately applicable 
include the router-level Internet, WWW, networks of criti- 
cal resources dependencies, as well as many types of social 
and biological networks, such as regulatory pathways [35]. 

A number of open problems remain. In particular, our 
techniques construct synthetic versions of real topologies 
available from measurement projects. However, it is well- 
known that in many cases, the outcome of measurements 
does not accurately represent a real complete topology. In 
fact, there might exist inherent limitations in measuring cer- 
tain network topologies with 100% accuracy. A venue for 
further research is the development of prediction techniques 
that extrapolate what we can presently measure in order to 
predict what we can not measure. 

Another substantial problem is the difficulties in validat- 
ing results of topology inference studies. For example, in 
the specific context of Internet topologies, validation is hard 
because of the unwillingness of service providers to release 
data on their infrastructure, network design, configuration, 
and performance. On the other hand, validation of any re- 
search result is a cornerstone to its reliability and utility. 
Therefore we believe it is imperative to focus on new valida- 
tion techniques that would combine the limited ground truth 
data available today with convincing testbed or simulation 
experiments. 
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